in-domain performance
Do Generalisation Results Generalise?
Boglioni, Matteo, Sgobbi, Andrea, Tavernini, Gabriel, Rita, Francesco, Mosbach, Marius, Pimentel, Tiago
A large language model's (LLM's) out-of-distribution (OOD) generalisation ability is crucial to its deployment. Previous work assessing LLMs' generalisation performance, however, typically focuses on a single out-of-distribution dataset. This approach may fail to precisely evaluate the capabilities of the model, as the data shifts encountered once a model is deployed are much more diverse. In this work, we investigate whether OOD generalisation results generalise. More specifically, we evaluate a model's performance across multiple OOD testsets throughout a finetuning run; we then evaluate the partial correlation of performances across these testsets, regressing out in-domain performance. This allows us to assess how correlated are generalisation performances once in-domain performance is controlled for. Analysing OLMo2 and OPT, we observe no overarching trend in generalisation results: the existence of a positive or negative correlation between any two OOD testsets depends strongly on the specific choice of model analysed.
- North America > Mexico > Mexico City > Mexico City (0.04)
- North America > Dominican Republic (0.04)
- North America > United States > Louisiana > Orleans Parish > New Orleans (0.04)
- (12 more...)
- Oceania > Australia > South Australia > Adelaide (0.04)
- North America > United States > New York (0.04)
- North America > Canada > British Columbia > Metro Vancouver Regional District > Vancouver (0.04)
Submission 180: Author Response
We thank the reviewers for their thoughtful comments. Reviewers have described our work as "extremely important in that it provides a reality check for Reviewers' comments have been paraphrased for brevity. R3: It looks like the random image regularizer hurts in-domain performance. R3: Do other VQA datasets (e.g., GQA, VCR) have the same problem? R2: Do other datasets for OOD evaluation have similar problems like VQA-CP?
- Oceania > Australia > South Australia > Adelaide (0.04)
- North America > Canada (0.04)
Submission 180: Author Response
We thank the reviewers for their thoughtful comments. Reviewers have described our work as "extremely important in that it provides a reality check for Reviewers' comments have been paraphrased for brevity. R3: It looks like the random image regularizer hurts in-domain performance. R3: Do other VQA datasets (e.g., GQA, VCR) have the same problem? R2: Do other datasets for OOD evaluation have similar problems like VQA-CP?
Proximal Supervised Fine-Tuning
Zhu, Wenhong, Xie, Ruobing, Wang, Rui, Sun, Xingwu, Wang, Di, Liu, Pengfei
Supervised fine-tuning (SFT) of foundation models often leads to poor generalization, where prior capabilities deteriorate after tuning on new tasks or domains. Inspired by trust-region policy optimization (TRPO) and proximal policy optimization (PPO) in reinforcement learning (RL), we propose Proximal SFT (PSFT), a fine-tuning objective that incorporates the benefits of trust-region, effectively constraining policy drift during SFT while maintaining competitive tuning. By viewing SFT as a special case of policy gradient methods with constant positive advantages, we derive PSFT that stabilizes optimization and leads to generalization, while leaving room for further optimization in subsequent post-training stages . Experiments across mathematical and human-value domains show that PSFT matches SFT in-domain, outperforms it in out-of-domain generalization, remains stable under prolonged training without causing entropy collapse, and provides a stronger foundation for the subsequent optimization. Recently, post-training has become a crucial part of the overall training process. In particular, reinforcement learning (RL) algorithms, such as PPO (Schulman et al., 2017) and GRPO (Shao et al., 2024), have demonstrated significant effectiveness when applied to language models (LMs) focused on reasoning tasks. As RL is scaled over time, foundation models gain the capacity to address complex problems through more profound and extended reasoning (OpenAI, 2024; Guo et al., 2025). These reasoning models offer an abundant and valuable latent thoughts (Ruan et al., 2025) across the internet.
- Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.91)
- Information Technology > Artificial Intelligence > Natural Language > Chatbot (0.89)
- Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (0.89)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.67)
LARGO: Low-Rank Regulated Gradient Projection for Robust Parameter Efficient Fine-Tuning
Zhang, Haotian, Liu, Liu, Yu, Baosheng, Qiu, Jiayan, Ren, Yanwei, Liu, Xianglong
The advent of parameter-efficient fine-tuning methods has significantly reduced the computational burden of adapting large-scale pretrained models to diverse downstream tasks. However, existing approaches often struggle to achieve robust performance under domain shifts while maintaining computational efficiency. To address this challenge, we propose Low-rAnk Regulated Gradient Projection (LARGO) algorithm that integrates dynamic constraints into low-rank adaptation methods. Specifically, LARGO incorporates parallel trainable gradient projections to dynamically regulate layer-wise updates, retaining the Out-Of-Distribution robustness of pretrained model while preserving inter-layer independence. Additionally, it ensures computational efficiency by mitigating the influence of gradient dependencies across layers during weight updates. Besides, through leveraging singular value decomposition of pretrained weights for structured initialization, we incorporate an SVD-based initialization strategy that minimizing deviation from pretrained knowledge. Through extensive experiments on diverse benchmarks, LARGO achieves state-of-the-art performance across in-domain and out-of-distribution scenarios, demonstrating improved robustness under domain shifts with significantly lower computational overhead compared to existing PEFT methods. The source code will be released soon.
- Europe > United Kingdom > England > Leicestershire > Leicester (0.04)
- Asia > China > Zhejiang Province > Hangzhou (0.04)
- Information Technology > Artificial Intelligence > Vision (1.00)
- Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
- Information Technology > Artificial Intelligence > Natural Language (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.68)
Compressing Language Models for Specialized Domains
Williams, Miles, Chrysostomou, George, Jeronymo, Vitor, Aletras, Nikolaos
Compression techniques such as pruning and quantization offer a solution for more efficient deployment of language models (LMs), albeit with small performance drops in benchmark performance. However, general-purpose LM compression methods can negatively affect performance in specialized domains (e.g. biomedical or legal). Recent work has sought to address this, yet requires computationally expensive full-parameter fine-tuning. To this end, we propose cross-calibration, a novel training-free approach for improving the domain performance of compressed LMs. Our approach effectively leverages Hessian-based sensitivity to identify weights that are influential for both in-domain and general performance. Through extensive experimentation, we demonstrate that cross-calibration substantially outperforms existing approaches on domain-specific tasks, without compromising general performance. Notably, these gains come without additional computational overhead, displaying remarkable potential towards extracting domain-specialized compressed models from general-purpose LMs.
- Asia > Thailand (0.14)
- North America > Mexico > Mexico City (0.14)
- Europe > Italy (0.14)
- (6 more...)
- Law (1.00)
- Health & Medicine (1.00)
- Government (1.00)
Review for NeurIPS paper: On the Value of Out-of-Distribution Testing: An Example of Goodhart's Law
Summary and Contributions: This paper provides an investigation of out-of-distribution generalization in visual question answering, as benchmarked by prior works on the VQA-CP dataset. The VQA-CP dataset by Agrawal et al. has different distributions in training and test, intentionally constructed so to encourage models to truly perform reasoning and generalize better, instead of naively picking up on question-only biases in the dataset. However, the authors demonstrate how several prior works on VQA-CP have (inadvertently) gamed this evaluation dataset without necessarily making progress due to a number of issues -- 1) exploiting knowledge of how the train/test splits were constructed to build models such that a) models are conditioned on the question prefix (and so will only work well on VQA-CP test and not generalize beyond), or b) poorly fit the training set. Next, the authors provide a few naive baselines that exploit the aforementioned issues (and as the authors acknowledge -- is not useful for any practical purposes) and perform well on VQA-CP test -- 1) a random predictions model that inverts the predicted answer distribution from training to test, and 2) a learned BUTD model that artificially ignores the top-predicted answer on VQA-CP test. The fact that a random predictions inverted model performs better on number and yes/no questions -- the question set that constitutes the largest fraction of performance -- is alarming, and provides a necessary and timely check on prior works on VQA-CP.
Back to Basics: A Simple Recipe for Improving Out-of-Domain Retrieval in Dense Encoders
Lee, Hyunji, Soldaini, Luca, Cohan, Arman, Seo, Minjoon, Lo, Kyle
Prevailing research practice today often relies on training dense retrievers on existing large datasets such as MSMARCO and then experimenting with ways to improve zero-shot generalization capabilities to unseen domains. While prior work has tackled this challenge through resource-intensive steps such as data augmentation, architectural modifications, increasing model size, or even further base model pretraining, comparatively little investigation has examined whether the training procedures themselves can be improved to yield better generalization capabilities in the resulting models. In this work, we recommend a simple recipe for training dense encoders: Train on MSMARCO with parameter-efficient methods, such as LoRA, and opt for using in-batch negatives unless given well-constructed hard negatives. We validate these recommendations using the BEIR benchmark and find results are persistent across choice of dense encoder and base model size and are complementary to other resource-intensive strategies for out-of-domain generalization such as architectural modifications or additional pretraining. We hope that this thorough and impartial study around various training techniques, which augments other resource-intensive methods, offers practical insights for developing a dense retrieval model that effectively generalizes, even when trained on a single dataset. Dense neural retrieval methods have been proven to be generally effective in many Information Retrieval (IR) tasks (Karpukhin et al., 2020; Izacard et al., 2021; Ni et al., 2021a). These methods use learned neural encoders to obtain dense vector representations of text and the relevance of passages for any given query is estimated by computing the dot product between their encodings. Dense approaches can outperform traditional retrieval techniques (e.g., BM25 (Robertson & Jones, 1976)), as they estimate similarity beyond syntactic matching (Lin et al., 2022). Neural retrieval models are effective rankers in domains for which large supervised datasets exist (e.g., MSMARCO (Campos et al., 2016) or Google NQ (Kwiatkowski et al., 2019)). Conversely, they might struggle to generalize to settings they have not been trained on, leading to challenges in handling out-ofdomain tasks (Thakur et al., 2021a; Ren et al., 2022; Lupart et al., 2023). In most real-world applications, supervision data is not available; whereas, retrieval models play a key role in the nascent field of augmented language models across many new exciting scenarios (Mialon et al., 2023).
- North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
- South America > Colombia > Meta Department > Villavicencio (0.04)
- Europe > Switzerland (0.04)
- (3 more...)